Introduction

Many colleges want to optimize the money they receive from their alumni. In order to do so, they need to identify and predict the salary/unemployment rate of recent graduates based on their education and other various factors. Doing so, they will be able to put more money into those programs to get a larger return on their investments (students).

Business Question:

Where can colleges put money in order to optimize the amount of money they receive from recent graduates?

Analysis Question:

Based on recent graduates and their characteristics/education, what would be their predicted median salary? Would they make over or less than six figures?

Background Information

This data is pulled from the 2012-12 American Community Survey Public Use Microdata Series, and is limited to those users under the age of 28. The general purpose of this code and data is based upon this story. This story describes the dilemma among college students about choosing the right major, considering the financial benefits of the field and the its maximized chance to graduate. It breaks down the overarching majors like “Engineering” and “STEM,” and dives deeper into what each major means in terms of later financial stability and its popularity in comparison to other majors. The actual dataset contains a detailed breakdown about basic earnings as well as labor force information, taking into account sex and the type of job acquired post graduation.

Process Overview

  1. Load & Clean Data
    1. Classify Variables Correctly
    2. One-Hot Encoding
  2. Exploratory Data Analysis
  3. Data Visualization
  4. Build a Linear Regression Model
    1. Prediction
    2. Evaluation
  5. Build a Random Forest Model
    1. Calculate mtry Level
    2. Optimize/Tune the Model
    3. Evaluation
  6. Fairness Assessment

Data Cleaning

A brief look at the raw data can be found below.

## 'data.frame':    172 obs. of  21 variables:
##  $ Rank                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Major_code          : int  2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
##  $ Major               : chr  "PETROLEUM ENGINEERING" "MINING AND MINERAL ENGINEERING" "METALLURGICAL ENGINEERING" "NAVAL ARCHITECTURE AND MARINE ENGINEERING" ...
##  $ Total               : int  2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
##  $ Men                 : int  2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
##  $ Women               : int  282 77 131 135 11021 373 1667 960 10907 16016 ...
##  $ Major_category      : chr  "Engineering" "Engineering" "Engineering" "Engineering" ...
##  $ ShareWomen          : num  0.121 0.102 0.153 0.107 0.342 ...
##  $ Sample_size         : int  36 7 3 16 289 17 51 10 1029 631 ...
##  $ Employed            : int  1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
##  $ Full_time           : int  1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
##  $ Part_time           : int  270 170 133 150 5180 264 296 553 13101 12695 ...
##  $ Full_time_year_round: int  1207 388 340 692 16697 1449 2482 827 54639 41413 ...
##  $ Unemployed          : int  37 85 16 40 1672 400 308 33 4650 3895 ...
##  $ Unemployment_rate   : num  0.0184 0.1172 0.0241 0.0501 0.0611 ...
##  $ Median              : int  110000 75000 73000 70000 65000 65000 62000 62000 60000 60000 ...
##  $ P25th               : int  95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
##  $ P75th               : int  125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
##  $ College_jobs        : int  1534 350 456 529 18314 1142 1768 972 52844 45829 ...
##  $ Non_college_jobs    : int  364 257 176 102 4440 657 314 500 16384 10874 ...
##  $ Low_wage_jobs       : int  193 50 0 0 972 244 259 220 3253 3170 ...
##  - attr(*, "na.action")= 'omit' Named int 22
##   ..- attr(*, "names")= chr "22"

As can be seen above, many of the categories are integer values. Many of these variables can be converted into factor variables in addition to the numerical ones. In addition, the variables Rank, Major Code, and Major can be dropped as the Rank variable highly correlates with the salary variable, and the other two are to specific and cannot be generalized.

majors_added_categorical <- majors_raw %>% mutate(Over.50K = ifelse(Median > 50000, "Over", "Under.Equal"), High.Unemployment = ifelse(Unemployment_rate > 0.5, "High", "Low")) %>% select(-1, -2, -3)

In addition, the categorical variable categories can be compressed in order for more useful data for the analysis.

## 
## Sciences     Arts    Other     STEM 
##       54       30       48       40

In order to do some analysis, all categorical variables need to be one hot encoded, which is done below:

# One Hot Encoded Data
majors_onehot <- one_hot(data.table(majors_factors), cols = c("Major_category", "High.Unemployment"))
# Normal Data
majors <- majors_factors

Exploratory Data Analysis

Before beginning with the analytical part of the exploration, it is beneficial to visualize and summarize the data in order to get a better understanding of the data in its entirety, and with an emphasis on variables you believe to be important for your analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22000   33000   36000   40077   45000  110000
##           Total       Men     Women ShareWomen Sample_size  Employed Full_time
## Total 1.0000000 0.8780884 0.9447645  0.1429993   0.9455747 0.9962140 0.9893392
## Men   0.8780884 1.0000000 0.6727589 -0.1120136   0.8751756 0.8706047 0.8935631
## Women 0.9447645 0.6727589 1.0000000  0.2978321   0.8626064 0.9440365 0.9176812
##       Part_time Full_time_year_round Unemployed Unemployment_rate     Median
## Total 0.9502684            0.9811118  0.9747684        0.08319170 -0.1067377
## Men   0.7515917            0.8924540  0.8694115        0.10150234  0.0259906
## Women 0.9545133            0.9057195  0.9116943        0.05910776 -0.1828419
##             P25th       P75th College_jobs Non_college_jobs Low_wage_jobs
## Total -0.07192608 -0.08319767    0.8004648        0.9412471     0.9355096
## Men    0.03872518  0.05239290    0.5631684        0.8514998     0.7913360
## Women -0.13773826 -0.16452834    0.8519460        0.8721318     0.9044699

The above confusion matrix details the correlation coefficients between all the respective variables with “Total,” “Men,” and “Women.” The correlation coefficient is a measure of the relationship strength between two different variables, with the magnitude closest to 1 or -1 indicating there is a strong direct and/or indirect relationship. Based on the output, it is important to note the differences among the “Employed” between men and women. Comparatively, there is a stronger direct relationship between women being employed (~0.945) when compared to men (~0.878). Similarly, women are more prone to work part-time (~0.917) when compared to men (~0.894). On the other hand, when comparing the median variable, which describes the median earnings of full-time year-round workers, women tend to have a slight inverse relationship (~ -0.182) whereas men have a slight direct relationship (~0.025). This is an important dissimilarity, considering women are slightly more employed yet do not payed as much in comparison

Data Vizualization

Now, we can visualize the dataset. To do this, we used the ggplot and plot_ly packages.

As can be seen above, the first graph we created is a polar graph. A polar graph allows the reader to understand the sampling distribution, as well as the amount of representation each major category has in the dataset. The larger the slice, the more representation the category has in the dataset. From the polar chart, Sciences has the largest amount of representation, followed closely by the Other category. STEM is third, but by a large margin, and Arts is last.

The next graph we created was a stacked bar graph. The major category is on the x-axis, while the count - normalized to be between 0 and 1 - is on the y-axis. The fill of the graph is based on whether or not a person from that category has a median salary that is larger than $50,000. From this graph, it seems that STEM majors have almost 50 percent of their category making above 50K per year - the largest percentage of the four major categories. The other three major categories are nowhere close to STEM, with the Other category coming in second with about 7 percent of their category making above 50K. Science is third with what seems to be about 1 percent of their category making above 50K, and Art is last with what seems to be 0 percent of their category making above 50K per year.

For our third graph, we decided to make a box-plot graph where the x-axis is the median salary and the y-axis is the four major categories. From this graph, it can be deduced that the range of STEM majors is higher than that of any other major. The range of STEM majors seems to be about 40-50K, whereas the other majors have a maximum range of 30K. There is a STEM major who currently has a median salary of 120K, which is almost double the highest median salary of any other major category. Another interesting aspect about the STEM box-plot, when compared to the other three, is that the median salary for the 25th percentile of STEM is equal to about 45K, which is higher than the median salary of the 75th percentile for any other category. The other three boxplots are relatively similar to each other, with the Art category being much skinnier than the other two. The skinnier the graph, the smaller the range of the graph.

Our final graph above is a three-dimensional scatterplot. The unemployment rate on a scale of 0-1 is on the x-axis, the share of women as a decimal is on the y-axis, and the z-axis has the salary of the low-wage jobs that people work. The color of the marker is dependent on how much money each person makes, and uses a gradient color scheme. From the graph, it can be determined that women do not make as much money as men as they are working in lower-wage jobs and have higher rates of unemployment. Another interesting thing to note is that only one student overall made above a 100K median salary.

Model Building Linear Regression

## [1] 172  22
## [1] 121  22
## [1] 26 22
## [1] 25 22
## Classes 'data.table' and 'data.frame':   121 obs. of  21 variables:
##  $ Total                  : int  2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
##  $ Men                    : int  2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
##  $ Women                  : int  282 77 131 135 11021 373 1667 960 10907 16016 ...
##  $ Major_category_Sciences: int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Major_category_Arts    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Major_category_Other   : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Major_category_STEM    : int  1 1 1 1 1 1 0 0 1 1 ...
##  $ ShareWomen             : num  0.121 0.102 0.153 0.107 0.342 ...
##  $ Sample_size            : int  36 7 3 16 289 17 51 10 1029 631 ...
##  $ Employed               : int  1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
##  $ Full_time              : int  1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
##  $ Part_time              : int  270 170 133 150 5180 264 296 553 13101 12695 ...
##  $ Full_time_year_round   : int  1207 388 340 692 16697 1449 2482 827 54639 41413 ...
##  $ Unemployed             : int  37 85 16 40 1672 400 308 33 4650 3895 ...
##  $ Unemployment_rate      : num  0.0184 0.1172 0.0241 0.0501 0.0611 ...
##  $ P25th                  : int  95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
##  $ P75th                  : int  125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
##  $ College_jobs           : int  1534 350 456 529 18314 1142 1768 972 52844 45829 ...
##  $ Non_college_jobs       : int  364 257 176 102 4440 657 314 500 16384 10874 ...
##  $ Low_wage_jobs          : int  193 50 0 0 972 244 259 220 3253 3170 ...
##  $ High.Unemployment_Low  : int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
median_mdl
## C5.0 
## 
## 121 samples
##  21 predictor
##   2 classes: 'Over', 'Under.Equal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.9103030  0.5955467
##   rules  FALSE   10      0.9159091  0.6193462
##   rules  FALSE   20      0.9209324  0.6407262
##   rules   TRUE    1      0.9150583  0.5344605
##   rules   TRUE   10      0.9079604  0.5216761
##   rules   TRUE   20      0.9144988  0.5592476
##   tree   FALSE    1      0.9127040  0.6083058
##   tree   FALSE   10      0.9193939  0.6468462
##   tree   FALSE   20      0.9243939  0.6600929
##   tree    TRUE    1      0.9117249  0.5339766
##   tree    TRUE   10      0.9146503  0.5458891
##   tree    TRUE   20      0.9163170  0.5668338
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
##  = FALSE.

Based on the accuracies and the respective kappa values for each set of trials, both with and without winnowing, the final model was chosen with 1 trial and without winnowing. This indicates that the lowest possible number of boosting iterations is favorable. The accuracy for winnowing with 20 trials was approximately 0.8834 and comparatively the accuracy for no winnowing with 1 trials was approximately 0.9295. Based on those values, the difference between the number of trials along with factoring in no winnowing or winnowing was a good amount; therefore, it is important to highlight that no winnowing and a small trial number would be the most suitable option for constructing this model.

# plot the model
plot(median_mdl)

Graphically, the difference between no winnowing (FALSE) and winnowing (TRUE) and the respective trials can be visualized. As seen in the graph, the graph trends downwards as the number of trials increased. The FALSE graph is also much higher relating to accuracy, when compared to TRUE. This visualization can be supported by the previous output that identified 1 trial and no winnowing, as the most favorable final model.

Prediction

## Confusion Matrix and Statistics
## 
##              Actual
## Prediction    Over Under.Equal
##   Over           2           1
##   Under.Equal    2          21
##                                           
##                Accuracy : 0.8846          
##                  95% CI : (0.6985, 0.9755)
##     No Information Rate : 0.8462          
##     P-Value [Acc > NIR] : 0.417           
##                                           
##                   Kappa : 0.5063          
##                                           
##  Mcnemar's Test P-Value : 1.000           
##                                           
##             Sensitivity : 0.50000         
##             Specificity : 0.95455         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.91304         
##               Precision : 0.66667         
##                  Recall : 0.50000         
##                      F1 : 0.57143         
##              Prevalence : 0.15385         
##          Detection Rate : 0.07692         
##    Detection Prevalence : 0.11538         
##       Balanced Accuracy : 0.72727         
##                                           
##        'Positive' Class : Over            
## 

From the generated confusion matrix, the most useful metrics for analysis would be the accuracy coupled with the F1 score. The goal is to have accuracy be as close to 1 as possible; therefore, with that understanding, the value of 0.9615 is very optimal. It could continue to be optimized to have the value be closer to 1, but that value would be considered “great” in terms of the model as a whole. The F1 score is a measure of the model’s accuracy. The value of 0.8571 from the statistics indicates that the model is pretty accurate but similar to the accuracy, it should also be improved in order to be as close to 1 as possible.

Evaluation

## C5.0 variable importance
## 
##   only 20 most important variables shown (out of 21)
## 
##                         Overall
## Men                      100.00
## Major_category_Other     100.00
## P75th                    100.00
## P25th                    100.00
## Major_category_STEM       99.17
## Unemployment_rate         97.52
## Low_wage_jobs             66.12
## ShareWomen                41.32
## Women                     30.58
## Sample_size               29.75
## Major_category_Sciences   24.79
## Non_college_jobs          20.66
## Total                     10.74
## Part_time                  0.00
## Full_time_year_round       0.00
## Full_time                  0.00
## Unemployed                 0.00
## High.Unemployment_Low      0.00
## Major_category_Arts        0.00
## Employed                   0.00
## C5.0 
## 
## 121 samples
##  21 predictor
##   2 classes: 'Over', 'Under.Equal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE   20      0.9209324  0.6407262
##   rules  FALSE   30      0.9145221  0.6129751
##   rules  FALSE   40      0.9143939  0.6021384
##   rules   TRUE   20      0.9144988  0.5592476
##   rules   TRUE   30      0.9144988  0.5592476
##   rules   TRUE   40      0.9144988  0.5592476
##   tree   FALSE   20      0.9243939  0.6600929
##   tree   FALSE   30      0.9210606  0.6490401
##   tree   FALSE   40      0.9224709  0.6456319
##   tree    TRUE   20      0.9179837  0.5743338
##   tree    TRUE   30      0.9179837  0.5743338
##   tree    TRUE   40      0.9179837  0.5743338
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
##  = FALSE.
## C5.0 
## 
## 121 samples
##  21 predictor
##   2 classes: 'Over', 'Under.Equal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.9103030  0.5955467
##   rules  FALSE   10      0.9159091  0.6193462
##   rules  FALSE   20      0.9209324  0.6407262
##   rules   TRUE    1      0.9150583  0.5344605
##   rules   TRUE   10      0.9079604  0.5216761
##   rules   TRUE   20      0.9144988  0.5592476
##   tree   FALSE    1      0.9127040  0.6083058
##   tree   FALSE   10      0.9193939  0.6468462
##   tree   FALSE   20      0.9243939  0.6600929
##   tree    TRUE    1      0.9117249  0.5339766
##   tree    TRUE   10      0.9146503  0.5458891
##   tree    TRUE   20      0.9163170  0.5668338
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
##  = FALSE.
## Confusion Matrix and Statistics
## 
##              Actual
## Prediction    Over Under.Equal
##   Over           2           1
##   Under.Equal    2          21
##                                           
##                Accuracy : 0.8846          
##                  95% CI : (0.6985, 0.9755)
##     No Information Rate : 0.8462          
##     P-Value [Acc > NIR] : 0.417           
##                                           
##                   Kappa : 0.5063          
##                                           
##  Mcnemar's Test P-Value : 1.000           
##                                           
##             Sensitivity : 0.50000         
##             Specificity : 0.95455         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.91304         
##              Prevalence : 0.15385         
##          Detection Rate : 0.07692         
##    Detection Prevalence : 0.11538         
##       Balanced Accuracy : 0.72727         
##                                           
##        'Positive' Class : Over            
## 
## Confusion Matrix and Statistics
## 
##              Actual
## Prediction    Over Under.Equal
##   Over           3           0
##   Under.Equal    0          22
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8628, 1)
##     No Information Rate : 0.88       
##     P-Value [Acc > NIR] : 0.04093    
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##              Prevalence : 0.12       
##          Detection Rate : 0.12       
##    Detection Prevalence : 0.12       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : Over       
## 

Model Building Classification Random Forest

## [1] 0.3953488
## 
## LE.EQ.20K     G.50K 
##       104        68
## [1] 121  21
## [1] 25 21
## [1] 26 21
## [1] 4.472136
##    X1.nrow.combined_RF.err.rate.       OOB LE.EQ.20K     G.50K
## 1                              1 0.2982456 0.3666667 0.2222222
## 2                              2 0.2325581 0.2549020 0.2000000
## 3                              3 0.2475248 0.2372881 0.2619048
## 4                              4 0.2110092 0.1904762 0.2391304
## 5                              5 0.2280702 0.1617647 0.3260870
## 6                              6 0.2288136 0.1830986 0.2978723
## 7                              7 0.2000000 0.1506849 0.2765957
## 8                              8 0.2250000 0.1917808 0.2765957
## 9                              9 0.1735537 0.1095890 0.2708333
## 10                            10 0.1900826 0.1232877 0.2916667
## 'data.frame':    121 obs. of  21 variables:
##  $ Total               : int  2339 756 1258 32260 3777 1792 91227 81527 15058 14955 ...
##  $ Men                 : int  2057 679 1123 21239 2110 832 80320 65511 12953 8407 ...
##  $ Women               : int  282 77 135 11021 1667 960 10907 16016 2105 6548 ...
##  $ Major_category      : Factor w/ 4 levels "Sciences","Arts",..: 4 4 4 4 3 1 4 4 4 4 ...
##  $ ShareWomen          : num  0.121 0.102 0.107 0.342 0.441 ...
##  $ Sample_size         : int  36 7 16 289 51 10 1029 631 147 79 ...
##  $ Employed            : int  1976 640 758 25694 2912 1526 76442 61928 11391 10047 ...
##  $ Full_time           : int  1849 556 1069 23170 2924 1085 71298 55450 11106 9017 ...
##  $ Part_time           : int  270 170 150 5180 296 553 13101 12695 2724 2694 ...
##  $ Full_time_year_round: int  1207 388 692 16697 2482 827 54639 41413 8790 5986 ...
##  $ Unemployed          : int  37 85 40 1672 308 33 4650 3895 794 1019 ...
##  $ Unemployment_rate   : num  0.0184 0.1172 0.0501 0.0611 0.0957 ...
##  $ Median              : int  110000 75000 70000 65000 62000 62000 60000 60000 60000 60000 ...
##  $ P25th               : int  95000 55000 43000 50000 53000 31500 48000 45000 42000 36000 ...
##  $ P75th               : int  125000 90000 80000 75000 72000 109000 70000 72000 70000 70000 ...
##  $ College_jobs        : int  1534 350 529 18314 1768 972 52844 45829 8184 6439 ...
##  $ Non_college_jobs    : int  364 257 102 4440 314 500 16384 10874 2425 2471 ...
##  $ Low_wage_jobs       : int  193 50 0 972 259 220 3253 3170 372 789 ...
##  $ Over.50K            : Factor w/ 2 levels "Over","Under.Equal": 1 1 1 1 1 1 1 1 1 1 ...
##  $ High.Unemployment   : Factor w/ 1 level "Low": 1 1 1 1 1 1 1 1 1 1 ...
##  $ combined_target     : Factor w/ 2 levels "LE.EQ.20K","G.50K": 1 1 1 2 2 2 1 1 1 2 ...
## mtry = 4  OOB error = 20.66% 
## Searching left ...
## mtry = 2     OOB error = 19.01% 
## 0.08 0.05 
## mtry = 1     OOB error = 29.75% 
## -0.5652174 0.05 
## Searching right ...
## mtry = 8     OOB error = 14.88% 
## 0.2173913 0.05 
## mtry = 16    OOB error = 9.09% 
## 0.3888889 0.05 
## mtry = 20    OOB error = 11.57% 
## -0.2727273 0.05

## 
## Call:
##  randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 16
## 
##         OOB estimate of  error rate: 12.4%
## Confusion matrix:
##           LE.EQ.20K G.50K class.error
## LE.EQ.20K        65     8   0.1095890
## G.50K             7    41   0.1458333

Tuning

Because the built in Random Forest Model was not agreeable with the tuning done with the caret library, an original random forest classification tuning metric was created in order to determine the best values for the three hyperparameters determined above.

Now, we can set the hyperparameter values to try and tune the model.

##    .mtry .sampsize .ntree
## 1      3        50    200
## 2      4        50    200
## 3      5        50    200
## 4      3       100    200
## 5      4       100    200
## 6      5       100    200
## 7      3       200    200
## 8      4       200    200
## 9      5       200    200
## 10     3        50    300
## 11     4        50    300
## 12     5        50    300
## 13     3       100    300
## 14     4       100    300
## 15     5       100    300
## 16     3       200    300
## 17     4       200    300
## 18     5       200    300
## 19     3        50    400
## 20     4        50    400
## 21     5        50    400
## 22     3       100    400
## 23     4       100    400
## 24     5       100    400
## 25     3       200    400
## 26     4       200    400
## 27     5       200    400
## 121 samples
##  19 predictor
##   2 classes: 'Over', 'Under.Equal' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 97, 97, 97, 96, 97, 97, ... 
## Resampling results across tuning parameters:
## 
##   mtry  sampsize  ntree  ROC        Sens       Spec     
##   3      50       200    0.9903810  0.8533333  1.0000000
##   3      50       300    0.9910159  0.8300000  1.0000000
##   3      50       400    0.9910159  0.8266667  1.0000000
##   3     100       200    0.9913333  0.8500000  1.0000000
##   3     100       300    0.9871429  0.8266667  1.0000000
##   3     100       400    0.9910159  0.8300000  1.0000000
##   3     200       200    0.9910159  0.8300000  1.0000000
##   3     200       300    0.9897460  0.8400000  1.0000000
##   3     200       400    0.9903492  0.8400000  0.9980952
##   4      50       200    0.9913333  0.8633333  1.0000000
##   4      50       300    0.9916508  0.8666667  1.0000000
##   4      50       400    0.9910159  0.8533333  1.0000000
##   4     100       200    0.9909841  0.9000000  1.0000000
##   4     100       300    0.9897143  0.8666667  1.0000000
##   4     100       400    0.9916508  0.8766667  1.0000000
##   4     200       200    0.9906984  0.8766667  1.0000000
##   4     200       300    0.9925873  0.8666667  1.0000000
##   4     200       400    0.9929206  0.8533333  1.0000000
##   5      50       200    0.9916508  0.9233333  1.0000000
##   5      50       300    0.9910159  0.8966667  1.0000000
##   5      50       400    0.9922857  0.8966667  1.0000000
##   5     100       200    0.9903810  0.8966667  1.0000000
##   5     100       300    0.9916508  0.8866667  1.0000000
##   5     100       400    0.9916508  0.9100000  1.0000000
##   5     200       200    0.9922857  0.8866667  1.0000000
##   5     200       300    0.9910159  0.8866667  1.0000000
##   5     200       400    0.9916508  0.8633333  1.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, ntree = 400 and sampsize
##  = 200.

Evaluation

# Evaluation of Model

Fairness Assessment

I believe that our model is fair and accounts for the singular protected class present in our dataset - women. Our dataset has a variable that accounts for the percentage of the workforce that women hold. Because of this, the models that we create are able to determine whether women are being treated fairly in the workplace or not. For example, our linear regression model is able to output that women are being paid an unfair median salary based on the amount of the market share they have. If our dataset did not have the sharewomen variable, our model would not be able to determine whether women are being paid fairly or not. ### Conclusion

What can you say about the results of the methods section as it relates to your question given the limitations to your model?

Future Recommendations

One additional piece of analysis that would benefit the report as a whole is using recently recorded data. The data that was used in this analysis was recorded from 2010-2012, so the trends that were discovered from our analysis are most likely outdated. Having new data would greatly benefit the university that wanted this report, as they would be able to adjust major categories based on newer trends rather than older ones. Another additional piece of analysis that would benefit our report would be the addition of the decision tree model. In our analysis, we included linear regression and the random forest model. However, we never include the decision tree model, which would have allowed us to see a model where the most optimal choice was made every time - since the decision tree is a greedy algorithm by nature. Including the decision tree would have made our analysis more diverse and well-rounded, as we would have had performed an analysis using three different major analytic methods. Personally, we don’t believe that anything limited our analysis on the project - the dataset was easy to work with and the models that we created learned the data efficiently and effectively.